-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance the formatting for Column #11724
Conversation
Expr::Alias(alias) => alias.expr.display_name(), | ||
_ => expr.display_name(), | ||
} | ||
.ok()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether this makes sense, but we need a name for the function dependencies, so I chose a name that wouldn't be affected by Display.
By the way, I'm confused about why we have so many different names or similar display methods for Expr. Maybe we should organize them or name them more clearly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#11782 filed
datafusion/common/src/column.rs
Outdated
@@ -372,7 +372,7 @@ impl FromStr for Column { | |||
|
|||
impl fmt::Display for Column { | |||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | |||
write!(f, "{}", self.flat_name()) | |||
write!(f, "{}", self.quoted_flat_name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought identifier are the one to be double quoted in your ticket 😕, but it seems you are modifying column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, an identifier is an unqualified column. Are they different? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
identifier could includes dot, that is the reason we need double quote to differentiate it from qualified column. Identifier with dot is relatively rare compare to column, so we don't have double quote every where.
In duckdb, they also has double quote for alias name (identifier).
D select t.a from t;
┌───────┐
│ a │
│ int32 │
├───────┤
│ 1 │
└───────┘
D select t.a as t.a from t;
Parser Error: syntax error at or near "."
LINE 1: select t.a as t.a from t;
^
D select t.a as "t.a" from t;
┌───────┐
│ t.a │
│ int32 │
├───────┤
│ 1 │
└───────┘
If we want to introduce dot notation for struct a.b
, it is consistent to duckdb if we don't have double quote for column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only adding double quote for identifier that includes dot is probably a better idea.
If we have column that has name includes dot, we display it like qualifier."name.with.dot"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
identifier could includes dot, that is the reason we need double quote to differentiate it from qualified column. Identifier with dot is relatively rare compare to column, so we don't have double quote every where.
Agreed. That makes sense to me. A Column
comprises multiple identifiers: catalog.schema.table.column
. All the parts are identifiers
. However, because we don't have an Expr
for an Identifier
, I can only handle it for Column
. Currently, we represent an identifier as an unqualified column.
datafusion/datafusion/expr/src/expr_fn.rs
Lines 90 to 92 in 0332eb5
pub fn ident(name: impl Into<String>) -> Expr { | |
Expr::Column(Column::from_name(name)) | |
} |
Ideally, all of the names
should be identifiers (e.g., table name, column name, function name, or alias) that should be quoted if needed. This purpose will impact many parts.
I think only adding double quote for identifier that includes dot is probably a better idea.
If we have column that has name includes dot, we display it like
qualifier."dot.column"
I believe quoted_flat_name()
that will invoke quote_identifier
handles this logic well.
datafusion/datafusion/common/src/utils/mod.rs
Lines 260 to 280 in 0332eb5
pub fn quote_identifier(s: &str) -> Cow<str> { | |
if needs_quotes(s) { | |
Cow::Owned(format!("\"{}\"", s.replace('"', "\"\""))) | |
} else { | |
Cow::Borrowed(s) | |
} | |
} | |
/// returns true if this identifier needs quotes | |
fn needs_quotes(s: &str) -> bool { | |
let mut chars = s.chars(); | |
// first char can not be a number unless escaped | |
if let Some(first_char) = chars.next() { | |
if !(first_char.is_ascii_lowercase() || first_char == '_') { | |
return true; | |
} | |
} | |
!chars.all(|c| c.is_ascii_lowercase() || c.is_ascii_digit() || c == '_') | |
} |
This function doesn't always quote the identifier. Only quote it if it contains invalid characters, such as a dot, bracket, or uppercase letter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only adding double quote for identifier that includes dot is probably a better idea.
If we have column that has name includes dot, we display it like
qualifier."name.with.dot"
I'm rethinking what you said. I did a test for the function name using the original method (flat_name).
let schema = Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("MAX(id)", DataType::Int32, false),
Field::new("state", DataType::Utf8, false),
]);
let plan = table_scan(Some("t"), &schema, None)
.unwrap()
.filter(col("state").eq(lit("CO")))
.unwrap()
.project(vec![col("MAX(id)"), max(col("id"))])
.unwrap()
.build()
.unwrap();
-----
right: "Projection: t.MAX(id), MAX(t.id)\n Filter: t.state = Utf8(\"CO\")\n TableScan: t"
We can distinguish between them easily.
Indeed, maybe we don't need to quote the identifier if it contains a bracket. Only quoting it for the dot might be simpler. 🤔
57d9801
to
f6d6a6f
Compare
fn quoted_flat_name_if_contain_dot(&self) -> String { | ||
match &self.relation { | ||
Some(r) => format!( | ||
"{}.{}", | ||
table_reference_to_quoted_string(r), | ||
quoted_if_contain_dot(&self.name) | ||
), | ||
None => quoted_if_contain_dot(&self.name).to_string(), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 Before I fix other tests, I want to check if this behavior makes sense. (It involves too many tests 😢 ).
Now, we only quote an identifier if it contains the dot. However, some cases like sum(t1.c1)
will also be quoted, even if it's a function call. I think it's not worth doing more checking to exclude this kind of case. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not ideal if sum(t1.c1)
is quoted 🤔 . I hope the change is as small as possible, so I would prefer to keep function or others Expr remain the same, only identifier with dot is quoted.
We could also hold on and wait for more input from other's about the change of this, given the change of this is not trivial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of modifying Column
, we should modify the display_name
for Expr
, so if we found column inside ScalarFunction, we could skip the double quote anyway. (by something like boolean flag?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of modifying
Column
, we should modify thedisplay_name
forExpr
, so if we found column inside ScalarFunction, we could skip the double quote anyway. (by something like boolean flag?)
I'm not sure if it's that simple 🤔. In my experience, the column might look like this:
Column { qualifier: None, name: "sum(t1.c1)" }
I think it's hard to find a consistent pattern for it because we use many Column::from_name
calls to create projections. For example, in
.map(|x| Column::from_name(self.ident_normalizer.normalize(x))) |
the column name could be complex and unruly.
I rethink about what the issue we are solving, so I would like to make sure we have such issue in practice. Is there any valid query that shows the ambiguity between the identifier and column?
I guess it is not possible to project with the same name, thus the issue you mentioned probably not exist 🤔 ? |
Indeed, I think the final plan generated by the planner is fine. I tried many SQL cases, and none of them generated the issue I mentioned. I encountered this issue when I used LogicalPlanBuilder to build the plan directly while implementing some analyzer rules (e.g., the expand wildcard rule) to transform the plans. When I print the intermediate plan to debug or check something, I get confused about the columns if they aren't qualified. |
I noticed that #11774 changes the debugging behavior of the plan. I guess it might improve the issue I encountered. |
What is the status of this PR? CI seems not to be passing and #11724 (comment) suggests maybe you don't think it is needed anymore @goldmedal ? Marking it as a draft to signify it isn't waiting on review anymore |
Yes, this change impacts too many parts, and it won't occur in any valid query (#11724 (comment)). #11774 has resolved the issue that occurred. I think we can close this PR. |
Thanks anyways for trying to make the code better @goldmedal |
Which issue does this PR close?
No corresponding issue.
Rationale for this change
Consider the following case:
Before this change, if we try to format
Column
, the result is:We can't distinguish between
col("t.id")
andident("t.id")
.Now, the result will be:
This change is beneficial for debugging.
What changes are included in this PR?
Are these changes tested?
yes
Are there any user-facing changes?
no